RAG에서 개인 데이터 준비 소개

RAG의 기초

Standard Large Language Models (LLMs) are "frozen" in time, limited by their training data cut-off. They cannot answer questions about your company’s internal handbook or a private video meeting from yesterday. 검색 보강 생성(RAG)자신의 개인 데이터에서 검색한 관련 컨텍스트를 제공함으로써 이 격차를 메운다.

다단계 워크플로우

개인 데이터를 LLM이 "읽을 수 있는" 형태로 만들기 위해 특정 파이프라인을 따릅니다:

로딩:다양한 형식(PDF, 웹, 유튜브 등)을 표준 문서 형식으로 변환한다.
분할:긴 문서를 더 작고 관리 가능한 "조각"으로 나눈다.
임베딩:텍스트 조각을 숫자 벡터(의미의 수학적 표현)로 변환한다.
저장:벡터를 벡터 스토어(예: Chroma)에 저장하여 빠르게 유사성 검색이 가능하게 한다.

왜 조각화가 중요한가?

LLM은 "컨텍스트 창"(한 번에 처리할 수 있는 텍스트 양의 제한)이 있다. 100페이지짜리 PDF를 보내면 모델이 실패한다. 우리가 데이터를 조각으로 나누는 이유는 모델에게 가장 관련성이 높은 정보만 전달되도록 하기 때문이다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why is chunk_overlap considered a critical parameter when splitting documents for RAG?

To reduce the total number of tokens used by the LLM.

To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.

To make the vector database store data faster.

Challenge: Preserving Context

Apply your knowledge to a real-world scenario.

You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."

Task

Which splitter would be best for keeping context like "Section Headers" intact?

Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.